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Abstract 

Can  humans  fly?  Emphatically  no.  Can  cars  eat?  Again, 
absolutely  not.  Yet,  these  absurd  inferences  result  from  the 
current  disregard  for  particular  types  of  actors  in  action 
understanding.  There  is  no  work  we  know  of  on  simulta¬ 
neously  inferring  actors  and  actions  in  the  video,  not  to 
mention  a  dataset  to  experiment  with.  Our  paper  hence 
marks  the  first  effort  in  the  computer  vision  community  to 
jointly  consider  various  types  of  actors  undergoing  various 
actions.  To  start  with  the  problem,  we  collect  a  dataset  of 
3782  videos  from  YouTube  and  label  both  pixel-level  actors 
and  actions  in  each  video.  We  formulate  the  general  actor- 
action  understanding  problem  and  instantiate  it  at  vari¬ 
ous  granularities:  both  video-level  single-  and  multiple- 
label  actor-action  recognition  and  pixel-level  actor-action 
semantic  segmentation.  Our  experiments  demonstrate  that 
inference  jointly  over  actors  and  actions  outperforms  infer¬ 
ence  independently  over  them,  and  hence  concludes  our  ar¬ 
gument  of  the  value  of  explicit  consideration  of  various  ac¬ 
tors  in  comprehensive  action  understanding. 

1.  Introduction 

Like  verbs  in  language,  action  is  the  heart  of  video  un¬ 
derstanding.  As  such,  it  has  received  a  significant  amount 
of  attention  in  the  last  decade.  Our  community  has  moved 
from  small  datasets  of  a  handful  of  actions  [12,  50]  to  large 
datasets  with  many  dozens  of  actions  [27,  45];  from  con¬ 
strained  domains  like  sporting  [42,  46]  to  videos  in-the-wild 
[38, 45].  Notable  methods  have  demonstrated  that  low-level 
features  [25,  33,  58,  59],  mid-level  atoms  [67],  high-level 
exemplars  [48],  structured  models  [42,  56],  and  attributes 
[37]  can  be  used  for  action  recognition.  Impressive  meth¬ 
ods  have  even  pushed  toward  action  recognition  for  multiple 
views  [40],  event  recognition  [20],  group-based  activities 
[32],  and  even  human-object  interactions  [15,  44]. 

However,  these  many  works  emphasize  a  small  subset  of 
the  broader  action  understanding  problem.  First,  aside  from 


Figure  1.  Montage  of  labeled  videos  in  our  new  actor-action 
dataset,  A2D.  Examples  of  single  actor-action  instances  as  well 
as  multiple  actors  doing  different  actions  are  present  in  this  mon¬ 
tage.  Label  colors  are  picked  from  the  HSV  color  space,  so  that 
the  same  objects  have  the  same  hue  (refer  to  Fig.  2  for  the  color- 
legend).  Black  is  the  background.  View  zoomed  and  in  color. 

Iwashita  et  al.  [19]  who  study  egocentric  animal  activities, 
these  existing  methods  all  assume  the  agent  of  the  action, 
which  we  call  the  actor ,  is  a  human  adult.  The  only  work 
we  are  aware  of  that  jointly  considers  different  types  of  ac¬ 
tors  and  actions  is  Xu  et  al.  [61],  but  this  work  uses  a  dataset 
two-orders  of  magnitude  smaller  (32  videos  versus  3782 
videos),  groups  all  animals  together  into  one  class  sepa¬ 
rate  from  humans,  and  is  primarily  a  visual-psychophysical 
study  using  off-the-shelf  vision  methods. 

Although  looking  at  people  is  certainly  a  relevant  appli¬ 
cation  domain  for  computer  vision,  it  is  not  the  only  one; 
consider  recent  advances  in  video-to-text  [1,  14]  that  can  be 
used  for  semantic  indexing  of  large  video  databases  [36], 
or  advances  in  autonomous  vehicles  [11].  In  these  applica¬ 
tions,  understanding  both  the  actor  and  the  action  are  critical 
for  success:  e.g.,  the  autonomous  vehicle  needs  to  distin¬ 
guish  between  a  child,  a  deer  and  a  squirrel  running  into  the 
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road  so  it  can  accurately  make  an  avoidance  plan.  Applica¬ 
tions  like  these,  e.g.,  robotic  autonomy  [55],  are  abundant 
and  growing. 

Second,  these  works  largely  focus  on  action  recogni¬ 
tion ,  which  is  posed  as  the  classification  of  a  pre-temporally 
trimmed  clip  into  one  of  k  action  classes  from  a  closed- 
world.  The  direct  utility  of  results  based  on  this  problem 
formulation  is  limited.  The  community  has  indeed  begun  to 
move  beyond  this  simplified  problem  into  action  detection 
[56,  65],  action  localization  [22,  39],  action  segmentation 
[23,  24],  and  actionness  ranking  [8].  But,  all  of  these  works 
do  so  strictly  in  the  context  of  human  actors. 

In  this  paper,  we  overcome  both  of  these  narrow  view¬ 
points  and  introduce  a  new  level  of  generality  to  the  ac¬ 
tion  understanding  problem  by  considering  multiple  differ¬ 
ent  classes  of  actors  undergoing  multiple  different  classes 
of  actions.  To  be  exact,  we  consider  seven  actor  classes 
(adult,  baby ,  ball ,  bird ,  car ,  cat ,  and  dog)  and  eight  action 
classes  (climb,  crawl,  eat,  fly,  jump,  roll,  run,  and  walk)  not 
including  the  no-action  class,  which  we  also  consider.  We 
formulate  a  general  actor-action  understanding  framework 
and  implement  it  for  three  specific  problems:  actor- action 
recognition  with  single-  and  multiple-label,  and  actor-action 
semantic  segmentation.  These  three  problems  cover  differ¬ 
ent  levels  of  modeling  and  hence  allow  us  to  analyze  the 
new  problem  thoroughly.  We  further  distinguish  our  work 
from  multi-task  learning  [5]  that  focuses  on  getting  a  shared 
representation  for  training  better  classifiers,  whereas  we  fo¬ 
cus  on  modeling  the  relationship  and  interactions  of  the  ac¬ 
tor  and  action  under  a  unified  graphical  model. 

To  support  these  new  actor-action  understanding  prob¬ 
lems,  we  have  created  a  new  dataset,  which  we  call  the 
Actor- Action  Dataset  or  A2D  (see  Fig.  1),  that  is  labeled  at 
the  pixel-level  for  actors  and  actions  (densely  in  space  over 
actors,  sparsely  in  time).  The  A2D  has  3782  videos  with  at 
least  99  instances  per  valid  actor-action  tuple  (Sec.  3  and 
Fig.  2  have  exact  statistics).  We  thoroughly  analyze  empir¬ 
ical  performance  of  both  state-of-the-art  and  baseline  meth¬ 
ods,  including  naive  Bayes  (independent  over  actor  and  ac¬ 
tion),  a  joint  product-space  model  (each  actor-action  pair  is 
considered  as  one  class),  and  a  bilayer  graphical  model  in¬ 
spired  by  [31]  that  connects  actor  nodes  with  action  nodes. 

Our  experiments  demonstrate  that  inference  jointly  over 
actors  and  actions  outperforms  inference  independently 
over  them,  and  hence,  supports  the  explicit  consideration  of 
various  actors  in  comprehensive  action  understanding.  In 
other  words,  although  a  bird  and  an  adult  can  both  eat, 
the  space-time  appearance  of  a  bird  eating  and  an  adult 
eating  are  different  in  significant  ways.  Furthermore,  the 
various  mannerisms  of  the  way  birds  eat  and  adults  eat 
mutually  reinforces  inference  over  the  constituent  parts. 
This  result  is  analogous  to  Sadeghi  and  Farhadi’s  visual 
phrases  work  [49]  in  which  it  is  demonstrated  that  joint 


detection  over  small  groups  of  objects  in  images  is  more 
robust  than  separate  detection  over  each  object  followed 
a  merging  process  and  to  Gupta  et  al.’s  [15]  work  on  hu¬ 
man  object-interactions  in  which  considering  specific  ob¬ 
jects  while  modeling  human  actions  leads  to  better  infer¬ 
ences  for  both  parts. 

Our  paper  marks  the  first  effort  in  the  computer  vision 
community  to  jointly  consider  various  types  of  actors  un¬ 
dergoing  various  actions.  As  such,  we  pose  two  goals:  first, 
we  seek  to  formulate  the  general  actor-action  understand¬ 
ing  problem  and  instantiate  it  at  various  granularities,  and 
second,  we  seek  to  assess  whether  or  not  it  is  beneficial 
to  explicitly  jointly  consider  actors  and  actions  in  this  new 
problem- space.  The  paper  describes  the  new  A2D  dataset 
(Sec.  3),  the  actor-action  problem  formulation  (Sec.  4)  and 
our  experiments  to  answer  this  question  (Sec.  5). 

2.  Related  Work 

A  related  work  from  the  action  recognition  community 
is  the  recent  Bojanowski  et  al.  [2]  paper,  which  focuses  on 
finding  different  human  actors  in  movies,  but  these  are  the 
actor-names  and  not  different  types  of  actors,  like  dog  and 
cat  as  we  consider  in  this  paper.  Similarly,  the  existing  work 
on  actions  and  objects,  such  as  [15,  44],  is  strictly  focused 
on  interaction  between  human  actors  manipulating  various 
objects  and  not  different  types  of  actors,  which  is  our  focus. 

The  remainder  of  the  related  work  section  discusses  seg¬ 
mentation,  which  is  a  major  emphasis  of  our  broader  view 
of  the  action  understanding  problem- space  and  yet  was  not 
discussed  in  the  introduction  (Sec.  1).  Semantic  segmen¬ 
tation  methods  can  now  densely  label  more  than  a  dozen 
classes  in  images  [13,  29,  30,  41]  and  videos  [21,  57]  under¬ 
going  rapid  motion;  semantic  segmentation  methods  have 
even  been  unified  with  object  detectors  and  scene  classifi¬ 
cation  [63],  extended  to  3D  [18,  28,  53]  and  posed  jointly 
with  attributes  [66],  stereo  [9,  31,  51]  and  SFM  [4,  10].  Al¬ 
though  the  underlying  optimization  problems  in  these  meth¬ 
ods  tend  to  be  expensive,  average-per-class  accuracy  scores 
has  significantly  increased,  for  example,  from  67%  in  [52] 
to  nearly  80%  in  [26,  29,  63]  on  the  popular  MSRC  seman¬ 
tic  segmentation  benchmark.  Further  works  have  moved  be¬ 
yond  full  supervision  to  weakly  supervised  object  discovery 
and  learning  [16,  54]. 

Other  related  works  include  unsupervised  video  object 
segmentation  [34,  35,  43,  64]  and  joint  temporal  segmenta¬ 
tion  with  action  recognition  [17].  These  video  object  seg¬ 
mentation  methods  are  class -independent  and  assume  a  sin¬ 
gle  dominant  object  (actor)  in  the  video;  they  are  hence  not 
directly  comparable  to  our  work  although  one  can  foresee  a 
potential  method  using  video  object  segmentation  as  a  pre¬ 
cursor  to  the  actor-action  understanding  problem. 

There  is  a  clear  trend  moving  toward  video  semantic  seg¬ 
mentation  and  toward  weak  supervision.  But,  these  exist- 
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Figure  2.  Statistics  of  label  counts  in  the  new  A2D  dataset.  We 
show  the  number  of  videos  in  our  dataset  in  which  a  given 
[actor,  action]  label  occurs.  Empty  entries  are  joint-labels  that  are 
not  in  the  dataset  either  because  they  are  invalid  (a  ball  cannot  eat) 
or  were  in  insufficient  supply,  such  as  for  the  case  dog-climb.  The 
background  color  in  each  cell  depicts  the  color  we  use  throughout 
the  paper;  we  vary  hue  for  actor  and  saturation  for  action. 


ing  works  in  semantic  segmentation  focus  on  labeling  pix¬ 
els/voxels  as  various  objects  or  background- stuff  classes. 
They  do  not  consider  the  joint  label- space  of  what  actions 
these  “objects”  may  be  doing.  Our  work  differs  from  them 
by  directly  considering  this  actor-action  problem,  while  also 
building  on  the  various  advances  made  in  these  papers. 

3.  A2D — The  Actor- Action  Dataset 

We  have  collected  a  new  dataset  consisting  of  3782 
videos  from  YouTube;  these  videos  are  hence  unconstrained 
“in-the-wild”  videos  with  varying  characteristics.  Figure  1 
has  single-frame  examples  of  the  videos.  We  select  seven 
classes  of  actors  performing  eight  different  actions.  Our 
choice  of  actors  covers  articulated  ones,  such  as  adult ,  baby , 
bird ,  cat  and  dog ,  as  well  as  rigid  ones,  such  as  ball  and 
car.  The  eight  actions  are  climbing ,  crawling ,  eating,  flying, 
jumping,  rolling,  running,  and  walking.  A  single  action- 
class  can  be  performed  by  various  actors,  but  none  of  the 
actors  can  perform  all  eight  actions.  For  example,  we  do  not 
consider  adult -fly  ing  or  ball-running  in  the  dataset.  In  some 
cases,  we  have  pushed  the  semantics  of  the  given  action 
term  to  maintain  a  small  set  of  actions:  e.g.,  car-running 
means  the  car  is  moving  and  ball-jumping  means  the  ball  is 
bouncing.  One  additional  action  label  none  is  added  to  ac¬ 
count  for  actions  other  than  the  eight  listed  ones  as  well  as 
actors  in  the  background  that  are  not  performing  an  action. 
Therefore,  we  have  in  total  43  valid  actor-action  tuples. 

To  query  the  YouTube  database,  we  use  various  text- 
searches  generated  from  actor-action  tuples.  Resulting 
videos  were  then  manually  verified  to  contain  an  instance  of 
the  primary  actor-action  tuple,  and  subsequently  temporally 
trimmed  to  contain  that  actor-action  instance.  The  trimmed 
videos  have  an  average  length  of  136  frames,  with  a  mini¬ 
mum  of  24  frames  and  a  maximum  of  332  frames.  We  split 
the  dataset  into  3036  training  videos  and  746  testing  videos 
divided  evenly  over  all  actor- action  tuples.  Figure  2  shows 
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Figure  3.  Histograms  of  counts  of  joint  actor-actions,  and  individ¬ 
ual  actors  and  actions  per  video  in  A2D;  roughly  one-third  of  the 
videos  have  more  than  one  actor  and/or  action. 


the  statistics  for  each  actor-action  tuple.  One-third  of  the 
videos  in  A2D  have  more  than  one  actor  performing  dif¬ 
ferent  actions,  which  further  distinguishes  our  dataset  from 
most  action  classification  datasets.  Figure  3  shows  exact 
counts  for  these  cases  with  multiple  actors  and  actions. 

To  support  the  broader  set  of  action  understanding  prob¬ 
lems  in  consideration,  we  label  three  to  five  frames  for  each 
video  in  the  dataset  with  both  dense  pixel-level  actor  and 
action  annotations  (Fig.  1  has  labeling  examples).  The  se¬ 
lected  frames  are  evenly  distributed  over  a  video.  We  start 
by  collecting  crowd-sourced  annotations  from  MTurk  us¬ 
ing  the  LabelMe  toolbox  [47],  then  we  manually  filter  each 
video  to  ensure  the  labeling  quality  as  well  as  the  tempo¬ 
ral  coherence  of  labels.  Video-level  labels  are  computed  di¬ 
rectly  from  these  pixel-level  labels  for  the  recognition  tasks. 
To  the  best  of  our  knowledge,  this  dataset  is  the  first  video 
dataset  that  contains  both  actor  and  action  pixel-level  labels. 

4.  Actor- Action  Understanding  Problems 

Without  loss  of  generality,  let  V  =  {tq, . . .  3  vn}  denote 
a  video  with  n  voxels  in  space-time  lattice  A3  or  n  super¬ 
voxels  in  a  video  segmentation  [7,  60,  62]  represented  as 
a  graph  Q  =  (V,  £)  where  the  neighborhood  structure  of 
the  graph  is  given  by  the  supervoxel  segmentation  method; 
when  necessary  we  write  £{v)  where  v  G  V  to  denote  the 
subset  of  V  that  are  neighbors  with  v.  We  use  X  to  de¬ 
note  the  set  of  actor  labels:  {adult,  baby,  ball,  bird,  car, 
cat,  dog},  and  we  use  y  to  denote  the  set  of  action  labels: 
{climbing,  crawling,  eating,  flying,  jumping,  rolling,  run¬ 
ning,  walking,  none1}. 

Consider  a  set  of  random  variables  x  for  actor  and  an¬ 
other  y  for  action;  the  specific  dimensionality  of  x  and  y 
will  be  defined  later.  Then,  the  general  actor-action  under¬ 
standing  problem  is  specified  as  a  posterior  maximization: 

(x*,y*)  =  argmaxP(x,y|V)  .  (1) 

x-y 

Specific  instantiations  of  this  optimization  problem  give  rise 
to  various  actor-action  understanding  problems,  which  we 
specify  next,  and  specific  models  for  a  given  instantiation 

'The  none  action  means  either  there  is  no  action  present  or  the  action 
is  not  one  of  those  we  have  considered. 
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will  vary  the  underlying  relationship  between  x  and  y  al¬ 
lowing  us  to  deeply  understand  their  interplay. 

4.1.  Single-Label  Actor-Action  Recognition 

This  is  the  coarsest  level  of  granularity  we  consider  in 
the  paper  and  it  instantiates  the  standard  action  recognition 
problem  [33].  Here,  x  and  y  are  simply  scalars  x  and  y ,  re¬ 
spectively,  depicting  the  single  actor  and  action  label  to  be 
specified  for  a  given  video  V.  We  consider  three  models  for 
this  case: 

Naive  Bayes:  Assume  independence  across  actions  and  ac¬ 
tors,  and  then  train  a  set  of  classifiers  over  actor  space  X 
and  a  separate  set  of  classifiers  over  action  space  y.  This  is 
the  simplest  approach  and  is  not  able  to  enforce  actor-action 
tuple  existence:  e.g.,  it  may  infer  adult-fly  for  a  test  video. 
Joint  Product  Space:  Create  a  new  label  space  Z  that  is 
the  joint  product  space  of  actors  and  actions:  Z  =  X  x  y. 
Directly  learn  a  classifier  for  each  actor-action  tuple  in  this 
joint  product  space.  Clearly,  this  approach  enforces  actor- 
action  tuple  existence,  and  we  expect  it  to  be  able  to  ex¬ 
ploit  cross-actor-action  features  to  learn  more  discrimina¬ 
tive  classifiers.  However,  it  may  not  be  able  to  exploit  the 
commonality  across  different  actors  or  actions,  such  as  the 
similar  manner  in  which  a  dog  and  a  cat  walk. 

Trilayer:  The  trilayer  model  unifies  the  naive  Bayes  and 
the  joint  product  space  models.  It  learns  classifiers  over 
the  actor  space  X ,  the  action  space  y  and  the  joint  actor- 
action  space  Z.  During  inference,  it  separately  infers  the 
naive  Bayes  terms  and  the  joint  product  space  terms  and 
then  takes  a  linear  combination  of  them  to  yield  the  final 
score.  It  models  not  only  the  cross-actor-action  but  also  the 
common  characteristics  among  the  same  actor  performing 
different  actions  as  well  as  the  different  actors  performing 
the  same  action. 

In  all  cases,  we  extract  local  features  (see  Sec.  5.1  for  de¬ 
tails)  and  train  a  set  of  one-vs-all  classifiers,  as  is  standard 
in  contemporary  action  recognition  methods,  and  although 
not  strictly  probabilistic,  can  be  interpreted  as  such  to  im¬ 
plement  Eq.  1. 

4.2.  Multi-Label  Actor- Action  Recognition 

As  noted  in  Fig.  3,  about  one-third  of  the  videos  in  A2D 
have  more  than  one  actor  and/or  action  present  in  a  given 
video.  In  many  realistic  video  understanding  applications, 
we  find  such  multiple-label  cases.  We  address  this  explicitly 
by  instantiating  Eq.  1  for  the  multi-label  case.  Here,  x  and 
y  are  binary  vectors  of  dimension  \X\  and  \y\  respectively. 
Xi  takes  value  1  if  the  ith  actor-type  is  present  in  the  video 
and  zero  otherwise.  We  define  y  similarly.  This  general 
definition,  which  does  not  tie  specific  elements  of  x  to  those 
in  y ,  is  necessary  to  allow  us  to  compare  independent  multi¬ 
label  performance  over  actors  and  actions  with  that  of  the 
actor- action  tuples.  We  again  consider  a  naive  Bayes  pair 


(a)  (b)  (c)  (d) 

Figure  4.  Visualization  of  different  graphical  models  to  solve 
Eq.  1.  The  figure  here  is  for  simple  illustration  and  the  actual 
voxel  or  supervoxel  graph  is  built  for  a  video  volume. 

of  multi-label  actor  and  action  classifiers,  multi-label  actor- 
action  classifiers  over  the  joint  product  space,  as  well  as  the 
trilayer  model  that  unifies  the  above  classifiers. 

4.3.  Actor- Action  Semantic  Segmentation 

Semantic  segmentation  is  the  most  fine-grained  instanti¬ 
ation  of  actor-action  understanding  that  we  consider,  and  it 
subsumes  other  coarser  problems  like  detection  and  local¬ 
ization,  which  we  do  not  consider  in  this  paper  for  space. 
Here,  we  seek  a  label  for  actor  and  action  per-voxel  over 
the  entire  video.  Define  the  two  sets  of  random  variables 
X  =  {#i, . . . , xn }  and  y  =  {j/i, . . . ,  yn}  to  have  dimen- 
sionality  in  the  number  of  voxels  or  supervoxels,  and  assign 
each  Xi  G  X  and  each  yi  E  y.  The  objective  function  in 
Eq.  1  remains  the  same,  but  the  way  we  define  the  graphical 
model  implementing  P(x,  y|V)  leads  to  acutely  different 
assumptions  on  the  relationship  between  actor  and  action 
variables. 

We  explore  this  relationship  in  the  remainder  of  this  sec¬ 
tion.  We  start  by  again  introducing  a  naive  Bayes-based 
model  that  treats  the  two  classes  of  labels  separately,  and  a 
joint  product  space  model  that  considers  actors  and  actions 
together  in  a  tuple  [x,  y].  We  then  explore  a  bilayer  model, 
inspired  by  Ladicky  et  al.  [31],  that  considers  the  inter¬ 
set  relationship  between  actor  and  action  variables.  Finally, 
we  introduce  a  new  trilayer  model  that  considers  both  intra- 
and  inter-set  relationships.  Figure  4  illustrates  these  various 
graphical  models.  We  then  evaluate  the  performance  of  all 
models  in  terms  of  joint  actor  and  action  labeling  in  Sec.  5. 

Naive  Bayes-based  Model  First,  let  us  consider  a  naive 
Bayes-based  model,  similar  to  the  one  used  for  actor-action 
recognition  earlier: 

P(x,y|V)  =  P(x|V)P(y|V)  (2) 

=n^te>n  n  P(xuxj)P{yi,yj) 

iev  ieVjes(i) 

oc  Y[Mxi)Myi) n  n  4*ij  (xi  5  xj  )4,ij  ( Hi  5  Uj  ) 

iev  ieVje£(i ) 

where  fa  and  fa  encode  the  separate  potential  functions  de¬ 
fined  on  actor  and  action  nodes  alone,  respectively,  and  faj 
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and  'ipij  are  the  pairwise  potential  functions  within  sets  of 
actor  nodes  and  sets  of  action  nodes,  respectively. 

We  train  classifiers  {fc \c  G  X}  over  actors  and  {gc\c  G 
y}  on  sets  of  actions  using  features  described  in  Sec.  5.3, 
and  <pi  and  ^  are  the  classification  scores  for  supervoxel  i. 
The  pairwise  edge  potentials  have  the  form  of  a  contrast- 
sensitive  Potts  model  [3]: 

^  =  {  exp(-0/(l +  *?•))  otherwise,  (3) 

where  xfj  is  the  y2  distance  between  feature  histograms  of 
nodes  i  and  j,  6  is  a  parameter  to  be  learned  from  the  train¬ 
ing  data,  ipij  is  defined  analogously.  Actor-action  semantic 
segmentation  is  obtained  by  solving  these  tw o  flat  CRFs  in¬ 
dependently. 

Joint  Product  Space  We  consider  a  new  set  of  random 
variables  z  =  . . . ,  znj  defined  again  on  all  supervox¬ 

els  in  a  video  and  take  labels  from  the  actor-action  prod¬ 
uct  space  Z  —  X  x  y .  This  formulation  jointly  captures 
the  actor-action  tuples  as  unique  entities  but  cannot  model 
the  common  actor  and  action  behaviors  among  different  tu¬ 
ples  as  later  models  below  do;  we  hence  have  a  single-layer 
graphical  model: 

P(x,y|V)  =  P( z|V)  4f [P(zO  n  n  P^zi) 
iev  ieVje£(i) 

oc  n  n  n  <Pij(zi,Zj)  (4) 

iev  iev  je£(i ) 

II  tPij([x^Vi\Axj,yj])  > 

iev  ieVje£(i) 

where  ipi  is  the  potential  function  for  joint  actor- action 
product  space  label,  and  pij  is  the  inter-node  potential  func¬ 
tion  between  nodes  with  the  tuple  [x,  y] .  To  be  specific,  pi 
contains  the  classification  scores  on  the  node  i  from  running 
trained  actor- action  classifiers  {hc\c  G  Z},  and  ipij  has  the 
same  form  as  Eq.  3.  Fig.  4  (b)  illustrates  this  model  as  a 
one  layer  CRF  defined  on  the  actor-action  product  space. 

Bilayer  Model  Given  the  actor  nodes  x  and  action 
nodes  y,  the  bilayer  model  connects  each  pair  of  random 
variables  {(xi,yi)}f=1  with  an  edge  that  encodes  the  po¬ 
tential  function  for  the  tuple  [x^yi\,  directly  capturing  the 
covariance  across  the  actor  and  action  labels.  We  have 

-P(x,y|V)  =  P[  P(xi,yi)  n  n  P{xi,Xj)P{yi,yj) 

iev  iev  je£(i) 

oc  n  <t>i(xi)i(>i(yi)^i{xi,yi)- 

iev 

n  n  <, hj(xi,xj)il>ij(yi,yj )  ,  (5) 

iev  je£(i) 


where  0.  and  'ip.  are  defined  as  earlier,  ^{xi,  yi)  is  a  learned 
potential  function  over  the  product  space  of  labels,  which 
can  be  exactly  the  same  as  (fi  in  Eq.  4  above  or  a  compatibil¬ 
ity  term  like  the  contrast  sensitive  Potts  model,  Eq.  3  above. 
We  choose  the  former  in  this  paper.  Fig.  4  (c)  illustrates  this 
model.  We  note  that  additional  links  can  be  constructed 
by  connecting  corresponding  edges  between  neighboring 
nodes  across  layers  and  encoding  the  occurrence  among  the 
bilayer  edges,  such  as  the  joint  object  class  segmentation 
and  dense  stereo  reconstruction  model  in  Ladicky  et  al.  [31]. 
However,  their  model  is  not  directly  suitable  here. 

Trilayer  Model  So  far  we  have  introduced  three  baseline 
formulations  in  Eq.  1  for  semantic  actor-action  segmenta¬ 
tion  that  relate  the  actor  and  action  terms  in  different  ways. 
The  naive  Bayes  model  (Eq.  2)  does  not  consider  any  rela¬ 
tionship  between  actor  x  and  action  y  variables.  The  joint 
product  space  model  (Eq.  4)  combines  features  across  ac¬ 
tors  and  actions  as  well  as  inter-node  interactions  in  the 
neighborhood  of  an  actor-action  node.  The  bilayer  model 
(Eq.  5)  adds  actor-action  interactions  among  separate  actor 
and  action  nodes,  but  it  does  not  consider  how  these  inter¬ 
actions  vary  spatiotemporally. 

Therefore,  we  introduce  a  new  trilayer  model  that  ex¬ 
plicitly  models  such  variations  (see  Fig.  4d)  by  combining 
nodes  x  and  y  with  the  joint  product  space  nodes  z: 

P(x,  y,z|  V)  =  P(x|V)P(y  |V)P(z|  V)  P(xit  Zi)P(yi,  Zi) 

iev 

oc  H  ^(x^ipiiy^tpiiz^mixi,  Zi)vi(yi,  z^- 

iev 

n  n  4>ij{xi,Xj)il>ij{yi,yj)i~Pij(zi,Zj)  ,  (6) 

iev  je£(i) 

where  we  define 


Vi{Xi,Zi) 

Vi(yi,Zi) 


w{yi\xi)  if  Xi  =  Xi  for  zt  =  [a;/,  y/] 

0  otherwise 

(7) 

w(xi'\yi)  if  yi  =  y/  for  z{  =  [V,?//] 

0  otherwise 


Terms  w(y[\xi)  and  w(x[\yi)  are  classification  scores  of 
conditional  classifiers,  which  are  explicitly  trained  for  this 
trilayer  model.  These  conditional  classifiers  are  the  main 
reason  for  the  increased  performance  found  in  this  method: 
separate  classifiers  for  the  same  action  conditioned  on  the 
type  of  actor  are  able  to  exploit  the  characteristics  unique 
to  that  actor-action  tuple.  For  example,  when  we  train  a 
conditional  classifier  for  action  eating  given  actor  adult ,  we 
use  all  other  actions  performed  by  adult  as  negative  training 
samples.  Therefore  our  trilayer  model  considers  all  rela¬ 
tionships  in  the  individual  actor  and  action  spaces  as  well  as 
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Model 

Classification  Accuracy  | 

|  Mean  Average  Precision 

Actor 

Action 

<A,  A> 

Actor 

Action 

<A,  A> 

Naive  Bayes 

70.51 

74.40 

56.17 

76.85 

78.29 

60.13 

JointPS 

72.25 

72.65 

61.66 

76.81 

76.75 

63.87 

Trilayer 

75.47 

75.74 

64.88 

78.42 

79.27 

66.86 

Table  1.  Single-label  and  multiple-label  actor-action  recognition 
in  the  three  settings:  independent  actor  and  action  models  (naive 
Bayes),  joint  actor-action  models  in  a  product-space  and  the  tri¬ 
layer  model.  The  scores  are  not  comparable  along  the  columns 
(e.g.,  the  space  of  independent  actors  and  actions  is  significantly 
smaller  than  that  or  actor- action  tuples);  the  point  of  comparison 
is  along  the  rows  where  we  find  the  joint  model  to  outperform 
the  independent  models  when  considering  both  actors  and  actions. 
<A,  A>  denotes  evaluating  in  the  joint  actor-action  product-space. 

the  joint  product  space.  In  other  words,  the  previous  three 
baseline  models  are  all  special  cases  of  the  trilayer  model. 
It  can  be  shown  that  the  solution  (x*,  y*,  z*)  maximizing 
Eq.  6  also  maximizes  Eq.  1  (see  Appendix). 

5.  Experiments 

We  thoroughly  study  each  of  the  instantiations  of  the 
actor-action  understanding  problem  with  the  overarching 
goal  of  assessing  if  the  joint  modeling  of  actor  and  action 
improves  performance  over  modeling  each  of  them  inde¬ 
pendently,  despite  the  large  space.  We  follow  the  training 
and  testing  splits  discussed  in  Sec.  3;  for  assigning  a  single¬ 
label  to  a  video  for  the  single-label  actor-action  recognition, 
we  choose  the  label  associated  with  the  query  for  which  we 
searched  and  selected  that  video  from  YouTube. 

5.1.  Single-Label  Actor-Action  Recognition 

Following  the  typical  action  recognition  setup,  e.g.,  [33], 
we  use  the  state-of-the-art  dense  trajectory  features  (trajec¬ 
tories,  HoG,  HoF,  MBHx  and  MBHy)  [58]  and  train  a  set  of 
1-versus-all  SVM  models  (with  RBF-y2  kernels  from  LIB- 
SVM  [6])  for  the  label  sets  of  actors,  actions  and  joint  actor- 
action  labels.  Specifically,  when  training  the  eating  clas¬ 
sifier,  the  other  seven  actions  are  negative  examples;  when 
we  train  the  bird-eating  classifier,  we  use  the  35  other  actor- 
action  labels  as  negative  examples. 

Table  1-left  shows  the  classification  accuracy  of  the 
naive  Bayes,  joint  product  space  and  trilayer  models,  in 
terms  of  classifying  actor,  action  and  actor- action  labels.  To 
evaluate  the  joint  actor- action  (the  <A,  A>  columns)  for  the 
naive  Bayes  models,  we  train  the  actor  and  action  classifiers 
independently,  apply  them  to  the  test  videos  independently 
and  then  score  them  together  (i.e.,  a  video  is  correct  if  and 
only  if  actor  and  action  are  correct).  We  observe  that  the 
independent  model  for  action  outperforms  the  joint  product 
space  model  for  action;  this  can  be  explained  by  the  regu¬ 
larity  across  different  actors  for  the  same  action  that  can  be 
exploited  in  the  naive  Bayes  model,  but  that  results  in  more 


inter-class  overlap  in  the  joint  product  space.  For  example, 
a  cat-running  and  a  dog-running  have  similar  signatures  in 
space-time:  the  naive  Bayes  model  does  not  need  to  distin¬ 
guish  between  these  two,  but  the  joint  product  space  does. 
However,  we  find  that  when  we  consider  both  the  actor  and 
action  in  evaluation,  it  is  clearly  beneficial  to  jointly  model 
them.  This  phenomenon  occurs  in  all  of  our  experiments. 
Finally,  the  trilayer  model  outperforms  the  other  two  mod¬ 
els  in  terms  of  both  individual  actor  or  action  tasks  as  well 
as  the  joint  actor- action  task.  The  reason  is  that  the  trilayer 
model  incorporates  both  types  of  relationships  that  are  sep¬ 
arately  modeled  in  the  naive  Bayes  and  joint  product  space 
models. 

5.2.  Multiple-Label  Actor- Action  Recognition 

For  the  multiple-label  case,  we  use  the  same  dense  trajec¬ 
tory  features  as  in  Sec.  5.1,  and  we  train  1-versus-all  SVM 
models  again  for  the  label  sets  of  actor,  action  and  actor- 
action  pairs,  but  with  different  training  regimen  to  capture 
the  multiple-label  setting.  For  example,  when  training  the 
adult  classifier,  we  use  all  videos  containing  any  actor  adult 
as  positive  examples  no  matter  the  other  actors  that  coexist 
in  the  positive  videos,  and  we  use  the  rest  of  videos  as  neg¬ 
ative  examples.  For  evaluation,  we  adapt  the  approach  from 
HOHA2  [40].  We  treat  multiple-label  actor-action  recogni¬ 
tion  as  a  retrieval  problem  and  compute  mean  average  preci¬ 
sion  (mAP)  given  the  classifier  scores.  Table  1 -right  shows 
the  performance  of  the  three  methods  on  this  task.  Again, 
we  observe  that  the  joint  product  space  has  higher  mAP  than 
naive  Bayes  for  the  joint  actor- action  evaluation.  We  also 
observe  the  trilayer  model  further  improves  the  scores  fol¬ 
lowing  the  same  trend  as  in  the  single-label  case. 

However,  we  also  note  that  large  improvement  in  the 
both  individual  tasks  from  the  trilayer  model.  This  im¬ 
plies  that  the  “side”  information  of  the  actor  when  doing 
action  recognition  (and  vice  versa)  provides  useful  infor¬ 
mation  to  improve  the  inference  task,  thereby  answering  the 
core  question  in  the  paper. 

5.3.  Actor-Action  Semantic  Segmentation 

State-of-the-Art  Pixel-Based  Segmentation.  We  first  ap¬ 
ply  the  state-of-the-art  robust  PN  model  [29]  at  the  pixel 
level;  we  apply  their  supplied  code  off-the-shelf  as  a  base¬ 
line.  The  average-per-class  performance  is  13.74%  for  the 
joint  actor-action  task,  47.2%  for  actor  and  34.49%  for  ac¬ 
tion.  We  suspect  that  the  modeling  at  pixel  and  superpixel 
level  can  not  well  capture  the  motion  changes  of  actions, 
which  explains  why  the  actor  score  is  high  but  the  other 
scores  are  comparatively  lower.  The  PN  model  could  be 
generalized  to  fit  within  our  framework,  which  we  leave 
for  future  work.  We  use  supervoxel  segmentation  and  ex¬ 
tract  spatiotemporal  features  for  assessing  the  various  mod¬ 
els  posed  for  actor-action  semantic  segmentation. 
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Model 

BK 

climb 

eat 

fly 

jump 

roll 

walk 

none 

climb 

eat 

jump 

roll 

run 

walk 

none 

crawl 

eat 

jump 

roll 

run 

walk 

none 

Average  Per  Class  Accuracy 

Naive  Bayes 
JointPS 

79.5 

21.0 

6.2 

28.7 

17.3 

28.3 

2.8 

29.3 

28.2 

24.3 

1.6 

38.2 

43.6 

1.0 

4.4 

6.1 

13.2 

5.3 

21.9 

35.9 

25.8 

4.3 

Model 

Actor 

Action 

<A,  A> 

75.1 

23.0 

15.5 

36.0 

19.2 

26.6 

7.5 

0.0 

19.4 

24.6 

4.1 

32.4 

28.5 

7.5 

0.5 

10.9 

24.2 

2.1 

21.1 

21.2 

38.2 

0.0 

Naive  Bayes 

43.02 

40.08 

16.35 

Conditional 

79.5 

23.2 

8.4 

40.7 

25.4 

30.5 

7.5 

0.0 

26.0 

30.5 

8.0 

31.7 

53.3 

9.1 

0.0 

7.4 

16.2 

3.1 

24.6 

29.3 

53.6 

0.0 

>3  JointPS 

|S  Conditional 
H  Bilayer 

Trilayer 

40.89 

43.02 

43.02 

43.08 

38.50 

41.19 

40.08 

41.61 

20.61 

22.55 

16.35 

22.59 

Bilayer 

79.7 

24.5 

13.3 

40.8 

13.0 

35.4 

7.0 

0.0 

32.7 

32.9 

1.1 

38.0 

37.0 

7.5 

0.1 

2.5 

22.8 

2.4 

35.9 

27.0 

29.6 

0.0 

Tri  layer 

78.5 

28.1 

18.2 

55.3 

20.3 

42.5 

9.0 

0.0 

33.1 

27.2 

6.1 

49.8 

48.5 

6.6 

0.0 

9.9 

31.0 

2.0 

27.6 

23.6 

39.4 

0.0 

adult 

baby 

j  ball 

car 

Average  Per  Class , 

Accuracy 

Model 

climb 

crawl 

eat 

jump 

roll 

run 

walk 

none 

climb 

crawl 

roll 

walk 

none 

fly 

jump 

roll 

none 

fly 

jump 

roll 

run 

none 

Model 

Actor 

Action 

<A,  A> 

Naive  Bayes 

21.5 

30.4 

21.5 

11.3 

5.0 

18.1 

11.5 

25.8 

21.6 

23.5 

20.5 

8.6 

7.4 

2.9 

13.6 

6.6 

8.6 

10.0 

71.2 

22.2 

5.5 

13.7 

H  Naive  Bayes 

44.78 

42.59 

19.28 

JointPS 

23.1 

59.3 

44.0 

17.5 

17.6 

34.6 

28.4 

21.4 

18.3 

24.0 

28.1 

17.2 

0.6 

0.0 

6.5 

4.7 

2.8 

13.2 

74.7 

43.9 

30.5 

8.1 

Rl  JointPS 

41.96 

40.09 

21.73 

Conditional 

18.5 

43.1 

36.3 

25.4 

17.4 

31.8 

30.7 

12.1 

26.5 

20.4 

36.7 

13.9 

5.6 

3.7 

16.2 

21.4 

9.0 

27.7 

77.6 

43.5 

37.2 

1.7 

H  Conditional 

44.78 

41.88 

24.19 

Bilayer 

27.2 

49.6 

51.6 

25.1 

28.4 

27.9 

39.2 

0.6 

13.2 

25.4 

44.0 

24.0 

0.0 

0.3 

10.3 

6.0 

0.0 

20.9 

76.8 

37.2 

39.6 

0.5 

Bilayer 

44.46 

43.62 

23.43 

Tri  layer 

33.1 

59.8 

49.8 

19.9 

27.6 

40.2 

31.7 

24.6 

20.4 

21.7 

39.3 

25.3 

0.0 

1.0 

11.9 

6.1 

0.0 

24.4 

75.9 

44.3 

48.3 

2.4 

Trilayer 

45.70 

46.96 

26.46 

Table  2.  Average  per-class  semantic  segmentation  accuracy  in  percentage  of  joint  actor-action  labels  for  all  models  (for  individual  classes, 
left,  and  in  summary,  right).  The  leading  scores  of  each  label  are  displayed  in  bold  font.  The  summary  scores  on  the  right  and  indicate  that 
the  trilayer  model,  which  considers  the  action  and  actor  models  alone  as  well  as  the  actor-action  product- space,  performs  best. 


Supervoxel  Segmentation  and  Features.  We  use  TSP  [7] 

to  obtain  supervoxel  segmentations  due  to  its  strong  per¬ 
formance  on  the  supervoxel  benchmark  [60].  In  our  exper¬ 
iments,  we  set  k  =  400  yielding  about  400  supervoxels 
touching  each  frame.  We  compute  histograms  of  textons 
and  dense  SIFT  descriptors  over  each  supervoxel  volume, 
dilated  by  10  pixels.  We  also  compute  color  histograms  in 
both  RGB  and  HSV  color  spaces  and  dense  optical  flow 
histograms.  We  extract  feature  histograms  from  the  entire 
supervoxel  3D  volume,  rather  than  a  single  representative 
superpixel  [57].  Furthermore,  we  inject  the  dense  trajectory 
features  [58]  to  supervoxels  by  assigning  each  trajectory  to 
the  supervoxels  it  intersects  in  the  video. 

Frames  in  A2D  are  sparsely  labeled;  to  obtain  a  super- 
voxel’s  groundtruth  label,  we  look  at  all  labeled  frames  in 
a  video  and  take  a  majority  vote  over  intersecting  labeled 
pixels.  We  train  sets  of  1-versus-all  SVM  classifiers  (linear 
kernels)  for  actor,  action,  and  actor-action  as  well  as  con¬ 
ditional  classifiers  separately.  The  parameters  of  the  graph¬ 
ical  model  are  tuned  by  empirical  search,  and  loopy  belief 
propagation  is  used  for  inference.  The  inference  output  is 
a  dense  labeling  of  video  voxels  in  space-time,  but,  as  our 
dataset  is  sparsely  labeled  in  time,  we  compute  the  average 
per-class  segmentation  accuracy  only  against  those  frames 
for  which  we  have  groundtruth  labels.  We  choose  average 
per-class  accuracy  over  global  accuracy  because  our  goal  is 
to  compares  actor  and  action  rather  than  full  video  labeling. 
Evaluation.  Table  2-right  shows  the  overall  performance 
of  the  different  methods.  The  upper  part  is  results  with  only 
the  unary  terms  and  the  lower  part  is  the  full  model  perfor¬ 
mance.  We  not  only  evaluate  the  actor-action  pairs  but  also 
individual  actor  and  action  tasks.  The  conditional  model  is 
a  variation  of  bilayer  model  with  different  aggregation — we 
infer  the  actor  label  first  then  the  action  label  conditioned 
on  the  actor.  Note  that  the  bilayer  model  has  the  same  unary 
scores  as  the  naive  Bayes  model  (using  actor  and  action 
ijji  outputs  independently)  and  the  actor  unary  of  the  condi¬ 
tional  model  is  the  same  as  that  of  the  naive  Bayes  model 
(followed  by  the  conditional  classifier  for  action). 

Over  all  models,  the  naive  Bayes  model  performs  worst, 


which  is  expected  as  it  does  not  encode  any  interactions 
between  the  two  label  sets.  We  observe  that  the  condi¬ 
tional  model  has  better  action  unary  and  actor-action  scores, 
which  indicates  that  knowing  actors  can  help  with  action  in¬ 
ference.  We  also  observe  that  the  bilayer  model  has  a  poor 
unary  performance  of  16.35%  (actor- action)  that  is  the  same 
as  naive  Bayes  but  for  the  full  model  it  improves  dramati¬ 
cally  to  23.43%,  which  suggests  that  the  performance  boost 
again  comes  from  the  interaction  of  actor  and  action  nodes 
in  the  full  bilayer  model.  We  also  observe  that  the  full  tri¬ 
layer  model  has  not  only  much  better  performance  in  the 
joint  actor- action  task,  but  also  better  scores  for  actor  and 
action  individual  tasks  in  the  full  model,  as  it  is  the  only 
model  considered  that  incorporates  classifiers  in  both  indi¬ 
vidual  actor  and  action  tasks  and  also  in  the  joint  space. 

Table  2-left  shows  the  comparison  of  quantitative  per¬ 
formance  for  specific  actors  and  actions.  We  observe  that 
the  trilayer  model  has  leading  scores  for  more  actor-action 
tuples  than  the  other  models.  The  trilayer  model  has  sig¬ 
nificant  improvement  on  labels  such  as  bird-flying ,  adult¬ 
running  and  cat-rolling.  We  note  the  systematic  increase 
in  performance  as  more  complex  actor-action  variable  in¬ 
teractions  are  included.  We  also  note  that  the  tuples  with 
none  action  are  sampled  with  greater  variation  than  the  ac¬ 
tion  classes  (Fig.  2),  which  contributes  to  the  poor  perfor¬ 
mance  of  none  over  all  actors.  Interestingly,  the  naive  Bayes 
model  has  relatively  better  performance  on  the  none  action 
classes.  We  suspect  that  the  label- variation  for  none  leads 
to  high-entropy  over  its  classifier  density  and  hence  when 
joint  modeling,  the  actor  inference  pushes  the  action  vari¬ 
able  away  from  the  none  action  class. 

Fig.  5  shows  example  segmentations.  Recall  that  the 
naive  Bayes  model  considers  the  actor  and  action  labeling 
problem  independent  of  each  other.  Therefore,  the  baby¬ 
rolling  in  the  second  video  get  assigned  with  actor  label  dog 
and  action  label  rolling  when  there  is  no  consideration  be¬ 
tween  actor  and  action.  The  bilayer  model  partially  recovers 
the  baby  label,  whereas  the  trilayer  model  successfully  re¬ 
covers  the  baby-rolling  label,  due  to  the  modeling  of  inter¬ 
node  relationship  in  the  joint  actor- action  space  of  the  tri- 
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Figure  5.  Comparative  example  of  semantic  segmentation  results.  These  sample  only  two  frames  from  the  each  dense  video  outputs. 


Ground  Truth  Trilayer  Ground  Truth  Trilayer 


Figure  6.  Example  results  from  the  trilayer  model  (upper  are  good, 
lower  are  failure  cases). 

layer  model.  We  also  visualize  more  example  outputs  of  the 
trilayer  model  in  Fig.  6.  Note  that  the  fragmented  segmenta¬ 
tion  in  the  ball  video  is  due  to  poor  supervoxel  segmentation 
algorithm  in  the  pre-processing  step.  We  also  show  trilayer 
failure  cases  in  the  bottom  row  of  Fig.  6,  which  are  due  to 
weak  cross-class  visual  evidence. 

6.  Discussion  and  Contributions 

Our  thorough  assessment  of  all  instantiations  of  the 
actor-action  understanding  problem  at  both  coarse  video¬ 
recognition  level  and  fine  semantic  segmentation  level  pro¬ 
vides  strong  evidence  that  the  joint  modeling  of  actor  and 
action  improves  performance  over  modeling  each  of  them 
independently.  We  find  that  for  both  individual  actor  and 
action  understanding  and  joint  actor-action  understanding, 
it  is  beneficial  to  jointly  consider  actor  and  action.  A  proper 
modeling  of  the  interactions  between  actor  and  action  re¬ 
sults  in  dramatic  improvement  over  the  baseline  models  of 
the  naive  Bayes  and  joint  product  space  models,  as  we  ob¬ 
serve  from  the  bilayer  and  trilayer  models. 

Our  paper  set  out  with  two  goals:  first,  we  sought  to 
motivate  and  develop  a  new,  more  challenging,  and  more 
relevant  actor-action  understanding  problem,  and  second, 
we  sought  to  assess  whether  joint  modeling  of  actors  and 


actions  improved  performance  for  this  new  problem.  We 
achieved  these  goals  through  the  three  contributions: 

1 .  New  actor-action  understanding  problem  and  dataset. 

2.  Thorough  evaluation  of  actor- action  recognition  and  se¬ 
mantic  segmentation  problems  using  state-of-the-art  fea¬ 
tures  and  models.  The  experiments  unilaterally  demon¬ 
strate  a  benefit  for  jointly  modeling  actors  and  actions. 

3.  A  new  trilayer  approach  to  recognition  and  semantic  seg¬ 
mentation  that  combines  both  the  independent  actor  and 
action  variations  and  product- space  interactions. 

Our  full  dataset,  computed  features,  codebase,  and  evalu¬ 
ation  regimen  are  released2  to  support  further  inquiry  into 
this  new  and  important  problem  in  video  understanding. 

Appendix 

We  show  that  a  solution  (x* ,  y* ,  z*)  maximizing  Eq.  6  also  max¬ 
imizes  Eq.  1.  First,  to  simplify  Eq.  6,  we  set  z  =  [x,  y].  Therefore 
we  can  obtain: 

P(x,y,[x,y]|V)  =  P/(x,y|V)  (8) 

i£V 

n  n  4>ij ,  Xj )rl>ij (yi ,  yj ) < Pij ( Zi ,Zj)  . 

ievjes(i) 

Theorem  1.  Let  (x*,y*)  =  arg maxx  y  P'(x,  y| V)  be 
the  optimal  solution  of  Eq.  8,  then  (x*,  y*,  [x*,y*])  = 

arg  maxx  y  z  P(x,  y,  z| V)  are  optimal  results. 

Proof.  First,  by  construction,  when  z  =  [x,  y]  then  P  (x,  y|  V)  = 
P(x,  y,  z|V).  The  rest  of  the  proof  follows  in  two  parts: 

•  Assume  that  z  =  [x,  y]  in  the  optimal  solu¬ 
tion  of  Eq.  6.  Then:  argmaxP(x,  y,  z|V)  = 

arg  maxz=[x  y]  P(x,  y,  z|V)  =  arg  max  P  (x,y|V). 

•  Assume  that  in  the  optimal  solution  (x,  y,  z)  that  z  = 
[x;,  y']  /  [x,  y].  Thus,  there  exists  some  x'  ^  x  or  y'  ^  y. 
According  to  the  definition  of  pi{xi,Zi)  and  iSi(yi,Zi)  in 
Eq.  7,  we  would  obtain  pii{xi,Zi)  —  0  or  Vi(yi,Zi)  —  0 
which  results  in  P(x,  y,  z|  V)  =0,  which  is  a  contradiction. 

Therefore,  we  prove  the  Theorem.  □ 


2http : / / web . eecs . umich . edu/ ~jjcorso/r/a2d/ 
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